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Introduction 


Imagine sitting at a café (or “other” public venue) and you hear a song 
playing on the stereo. You decide that you really like it, but you don’t know 
the name of the song. There’s a solution for that. Software song 
identification has been a topic of interest for years. However, it is 
computationally difficult to tackle this problem using conventional 
algorithms. Frequency analysis provides for a fast and accurate solution to 
this problem, and we decided to use this analysis to come up with a fun 
project idea. The main purpose of our project was to be able to accurately 
match a noisy song segment with a song in our song library. The company 
Shazam was our main inspiration and we started out by studying how 
Shazam works. 


The Fingerprint of a Song 


Just like how every individual has a unique fingerprint that can be used to 
distinguish one person from another, our algorithm creates a digital 
fingerprint for each song that can be used to distinguish two songs. The 
song’s fingerprint consists of list of time-frequency pairs that uniquely 
represent all the significant peaks in the song’s spectrogram. To assure 
accurate matching between two fingerprints, our algorithm needs to take 
into account the following issues when choosing peaks for the fingerprint: 


¢ Uniqueness — The fingerprint of each song needs to be unique to that 
one song. Fingerprints of different songs need to be different enough to 
be easily distinguished by our scoring algorithm. 

e Sparseness — The computational time of our matched filter depends on 
the amount of data in each song’s fingerprint. Thus each fingerprint 
needs to sparse enough for fast results, but still contain enough 
information to provide accurate matches. 

¢ Noise Resistant — Song data may contain large amounts of 
background noise. The fingerprinting algorithm must be able to 
differentiate between the signal and added noise, storing only the 
signal information in the fingerprint. 


These criteria are all met by identifying major peaks in the song’s 
spectrogram. The following section describes the fingerprinting algorithm 
in more detail. 


The Fingerprint Finding Algorithm 


Filtering and Resampling 


After the song data is imported, the signal is then resampled to 8000 
samples per second in order to reduce the number of columns in the 
spectrogram. This will speed up later computations but still leaves enough 
resolution in the data for accurate results. 


Then the data is high-pass filtered using a 30" order filter with a cutoff 
frequency around 2KHz (half the bandwidth of the resampled signal). 
Filtering is used because the higher frequencies in songs are more unique to 
each individual song. The bass, however, tends to overshadow these 
frequencies, thus the filter is used make fingerprint include more high 
frequencies points. Testing has shown that the algorithm has a much easier 
time distinguishing songs after they are high-pass filtering. 


The Spectrogram 


The spectrogram of the signal is then taken in order to view the frequencies 
present in each time slice. The spectrogram below is from a 10 second 
noisy recording. 


Spectrogram 
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The effect of the low-pass filter is clearly 
visible in the spectrogram. However, local 
maxima in the low frequencies still exist 
and will still show up in the fingerprint. 


Each vertical time slice in the bin is then analyzed for prominent local 
maxima as described in the next section. 


Finding the Local Maxima 


In the first time slice, the five greatest local maxima are stored as points in 
the fingerprint. Then a threshold is created by convolving these five 
maxima with a Gaussian curve, creating a different value for the threshold 
at each frequency. An example threshold is shown in the figure below. The 
threshold is used to spread out the data stored in the fingerprint, since peaks 
that are close in time and frequency are stored as one point. 
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The initial threshold, formed by 
convolving the peaks in the first time 
slice with a Gaussian curve. 


For each of the remaining time slices, up to five local maxima above the 
threshold are added to fingerprint. If there are more than five maxima, then 
the five greatest in amplitude are chosen. The threshold is then updated by 
adding new Gaussian curves centered at the frequencies of the newly found 
peaks. Finally the threshold is scaled down so that it decays exponentially 
over time. The following figure shows how the threshold changes over 


time. 


Change of Threshold over Time 
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The threshold increases whenever a new 
peak is formed around that peak’s 
frequency and decays exponentially over 
time. 


The final list of the time and frequencies of the local maxima above the 
threshold are returned as the song’s fingerprint. 


The Resulting Fingerprint 


The following is the fingerprint of the sample signal from the examples 
above. 
Spectrogram Peaks 
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The fingerprint of the 10 second segment 
from the previous examples 


From the graph, it is easy to see patterns and different notes in the song. 
Lets see how the algorithm addresses the three issues identified in the first 
paragraph: 


e Uniqueness — The algorithm only stores the prominent peaks in the 
spectrogram. Different songs have a different pattern of peaks in 
frequency and time, thus each song will have a unique fingerprint. 

e Sparseness — The algorithm only picks up at most five peaks per time 
Slice. This limits the number of peaks in the resulting fingerprint. The 
threshold spreads out the positions of peaks so that the fingerprint is 
more representational of the data. 

e Noise Resistant — Unless the background noise is loud enough to 
create peaks greater than the peaks present in the song, then very little 
noise will show up in the fingerprint. Also, a ten second segment has 
around 6000 data points, so a matched filter will be able to detect a 


match between two fingerprints, even with a reasonable amount of 
added noise. 


The next section will detail the process used to compare the fingerprint of 
the song segment to the fingerprints of the songs in the library. 


Matched Filter for Spectrogram Peaks 


In order to compare songs, we can generate match scores for them using a 
matched filter. We wanted a filter capable of taking the spectral peaks 
information generated by the fingerprint finding algorithm for two different 
songs and produce a single number that would tell us how much the two 
songs being compared look alike. We wanted this filter to be as insensitive 
as possible to noise and produce a score that is independent of the length of 
each recording. 


Our approach to this was completely different from that used by the creators 
of Shazam, as we did not use Hash tables at all and did not combine the 
peaks into pairs limited by certain regions, as they did. In the end we still 
managed to get very good accuracy and decent performance by using a 
matched filter. 


The Matched Filter Algorithm 


Preparation 


Before filtering, we take the lists of spectral peaks that is the output of the 
landmarks generator algorithm and generate matrices that are the same size 
as the spectrograms, with the peaks replaced by 1’s in their respective 
positions and all other points replaced by 0’s. At some point during our 
project we had the idea of convolving this matrix with a Gaussian curve, in 
order to allow peaks to match somewhat if they were shifted only slightly. 
However, we later determined that even a very small Gaussian would 
worsen our noise resistance, so this idea was dropped. So basically now we 
have one map for each song that shows the position in time and frequency 
bins of all peaks. Next we normalize these matrices using their Frobenius 
norm. This ensures that the final score is normalized. Then we apply the 
matched filter which basically consists of flipping one of the matrices and 
convolving them, which is done by zero padding them both to the proper 
size and multiplying their 2D FFT’s, for speed. The result is a cross 
correlation matrix, but we still need to extract a single number from it to be 
our match score. 


Extracting Information from the Cross Correlation Matrix 


Through much testing, we determined that the most accurate and noise- 
resistant measure of the match was simply taking the global maximum of 
the result. Other approaches that we tried, such as taking the trace of the 
X!X or the sum of the global maxima for each row or column, had much 
more frequent mismatches. Taking just the global maximum of the whole 
matrix was simple and extremely effective. 


When looking at test results, however, we saw that the score still had a 
certain dependency on the size of the segments being compared. Through 
more testing, we determined that this dependency looked approximately 
like a dependency on the square root of the ratio of the lower number of 
peaks by the higher number of peaks, when testing with a noiseless 
fragment of a larger song. This can be seen in this plot: 


Match Scores Dependency on Number of Peaks 
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Number of Peaks 


A plot showing the score of a song fragment that 
should perfectly match the song it was taken 
from, seen without correcting the square root 

dependency mentioned above 


In the plot above, the original segment has 6915 peaks and the fragment 
was tested with between 100 and 5000 peaks, in intervals of 100. Since 
smaller sample sizes usually lead to having fewer peaks, we had to get rid 
of this dependency. To prevent the square root growth of the scores, the 
final score is multiplied by the inverse of this square root, yielding a match 
score that is approximately independent of sample size. This can be seen in 
the next stem plot, made with the same segments as the first: 


Match Scores After 1/sqrt Normalization 
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The same plot shown before, but with the square 
root dependency on number of peaks removed 


So clearly this allows us to get better match scores with small song 
segments. After this process, we had a score that was approximately 
independent of segment size, normalized and could tell apart matches and 
mismatches, even with lots of noise. All that was left was to test it against 
different sets of data and set a threshold for distinguishing between matches 
and non-matches. 


Setting a Threshold 


The filter’s behavior proved to be very consistent. Perfect matches (trying 
to match a segment with itself) always got scores of 1. Matching noiseless 
segments to the whole song usually yielded scores in the upper .8’s or in the 
.9’s, with a few rare exceptions that could have been caused by a bad choice 
of segment, such as a segment with a long period of silence, for example. 
Noisy segments usually gave us low scores such as in the .1’s, but more 


importantly mismatches were even lower, in the .05’s to .07’s or so. This 
allowed us to set a threshold for determining when we have a match or not. 


During our testing, we considered using a statistical approach to set the 
threshold. For example, if we wanted a 95% certainty that a song matched, 
we could require the highest match score to be greater than 1.66*[o/sqrt(n)] 
+ y, where o is the standard deviation, n is the sample size and p is the 
mean. However, with our very small sample size, this threshold seemed to 
yield inaccurate results, so the simple threshold criterion of the highest 
match having to be at least 1.5 times the second highest in order to be 
considered a match was used. 


Similarities and Differences from Shazam’s Approach 


Even though we followed the ideas in the paper by Wang, we still had some 
significant differences from the approach used by Shazam. We followed the 
ideas they had for fingerprint creation, to a certain extent, however the 
company uses hash tables instead of matched filters to perform the 
comparison. While evidently faster than using a matched filter, hash tables 
are not covered in ELEC 301. Furthermore, when making a hash, Wang 
says they combine several points in an area with an anchor point and pair 
them up combinatorially. This allows the identification of a time offset to be 
used with the hash tables and makes the algorithm even faster and more 
robust. Perhaps investigating this would be an interesting extension of the 
project, if we had more time. 


Results 


The final step in the project was to test the algorithm we had created so we 
went ahead and conducted a series of tests that would evaluate mostly 
correctness but also, to some extent, performance. 


Testing 


First, we wanted to test to make sure that our algorithm was working 
properly. To do this, we attempted to match short segments of the original 
song (i.e. “noiseless”, actual copies of the library songs) of approximately 
ten seconds in length. The table below shows how these original clips 
matched. The titles from left to right are song segments, and titles running 
from top to bottom are library songs. We abbreviated them from the 
original, so they would fit in the matrix. The original names are “Stop this 
Train”, by John Mayer, “Semi-Charmed Life”, by Third Eye Blind, “I’ve 
got a Feeling” by Black Eyed Peas, “Love Like Rockets”, by Angels and 
Airwaves, “Crash Into Me”, by Dave Matthews Band and “Just Another 
Day in Paradise”, by Phil Vassar. 


Noiseless Recordings 
Life Feeling Rockets Crash Paradise 
Life 
Feeling 
Rockets 
Crash | 0.1256] 0.0593] 0.0967] 0.0649 
Paradise 


Train 


This matrix shows the match score results of the six 
noiseless recordings made from fragments of songs in 
the database, each of them compared to all songs in 
the database 


The clear matches with highest scores can be seen along the diagonal. Most 
of these are close to 1, and each match meets our criteria of being 1.5 times 
greater than the other scores (comparing horizontally.) This was a good test 
that we were able to use to modify our algorithm and try different 
techniques. Ultimately, the above results showed that our code was 
sufficient for our needs. 


We then needed to see if our code actually worked with real world (noisy) 
song segments. Songs were recorded on an iPhone simultaneously with 
various types of noise as follows: Train- low volume talking, Life- loud 
recording (clipping), Crash- typing, Rockets- repeating computer error 
noise, Feeling- Gaussian noise (added in Matlab to wav file), and Paradise- 
very loud talking. There were two additional songs we used in this test to 
check for robustness and proper matching. One is a live version of Crash, 
which includes a lot of crowd noise but does not necessarily have all the 
identical features of the original Crash fingerprint. The other additional 
song, “Yellow”, by Coldplay, is a song that is not in our library at all. 


Noisy Recordings Using an iPhone 

Life Feeling Rockets Crash naree ee Crash fixel Yellow 
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Crash | 0.0675| 0.0631) 0.0733] 0.0741] 0.0668] 0.0682 0.071 
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This matrix shows the match score results of the six noisy 
recordings made from fragments of songs in the database, plus 
a live version of a song in the database and another song 
entirely not in the database 


Again, the clear matches are highlighted in yellow along the diagonal. The 
above results show that our algorithm can still accurately match the song 
segments in more realistic conditions. The graph below shows more 
interesting results. 


Score of Noisy Song Segments Against Songs 
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This plot is a visual representation of the 
results matrix seen above 


Conclusions 


As before, the matches in the first six songs (from left to right) are obvious, 
and Yellow does not show any clear correlation to any library song, as 
desired, but the live version of Crash presents an interesting question. Do 
we actually want this song to match? Since we wanted our fingerprinting 
method to be unique to each song and song segment, we decided it would 
be best to have a non-match in this scenario. However, if one observes 
closely, it can be seen that the closest match (though it is definitely not 
above the 1.5 mark) is, in fact, matching to the original Crash. This emerges 
as a small feature of our results. This small “match” says that although we 
may not match any songs in the library, we can tell you that this live version 


most resembles the original Crash version, which may be a desirable 
outcome if we were to market this project. 


We were amazed that the final filter could perform so well. The idea of 
completely ignoring amplitude information in the filter came from the paper 
by Avery Li-Chun Wang, one of Shazam’s developers. As he mentions, 
discarding amplitude information makes the algorithm more insensitive to 
equalization. However, this approach also makes it more noise resistant 
since, since what we do from there on basically consists of counting 
matching peaks versus non-matching peaks. Any leftover noise will count 
very little towards the final score, as the number of peaks per area in the 
spectrogram is limited by the thresholding algorithm and all peaks have the 
Same magnitude in the filter. 
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